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(57) Abstract 

A method and apparatus for measuring the concentration of an analy te present in a biological fluid is disclosed. The method includes 
the steps of applying NIR radiation to calibration samples to produce calibration data, analyzing calibration data to identify and remove 
outliers, constructing a calibration model, collecting and analyzing unknown samples to identify and remove outliers, and predicting analyte 
concentration of non-outliers from the calibration model. Analysis of calibration data includes data pretreatment, data decomposition to 
remove redundant data, and identification and removal of outliers using generalized distances. The apparatus (100) includes a pump (102) 
which circulates a sample through tubing (104) to fill a flowcell (106). Light from a NIR source (114) is synchronized with a detector 
(110), facilitating light and dark measurements, and passes through a monochrometer (120) and the flowcell (106) and strikes the detector 
(110), whereby radiation transmitted through the sample is measured. 
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BIOLOGICAL FLUID ANALYSIS USING DISTANCE OUTLIER DETECTION 



CROSS-REFERENCE TO PROVISIONA L APPLICATION 

Benefit of applicants' prior filed copending provisional application 
number 60/001,950 is hereby claimed. 
BACKGROUND OF THE INVENTION 

Spectral analysis is widely used in identifying and quantitat;ng 
analytes in a sample of a material. One form of spectral analysis 
measures the amount of electromagnetic radiation which is absorbed by a 
sample. For example, an infrared spectrophotometer directs a beam of 
infrared radiation towards a sample, and then measures the amount of 
radiation absorbed by the sample over a range of infrared wavelengths. 
An absorbance spectrum may then be plotted which depicts sample 
absorbance as a function of wavelength. The shape of the absorbance 
spectrum, including relative magnitudes and wavelengths of peak 
absorbances, serves as a characteristic 'fingerprint' of particular analytes 
in the sample. . . .,- 

The absorbance spectrum may furnish information useful in 
identifying analytes present in a sample. In addition, the absorbance 
spectrum can also be of use for quantitative analysis of the concentration 
of individual analytes in the sample. In many instances, the absorbance 
of an analyte in a sample is approximately proportional to the 
concentration of the analyte in the sample. In those cases where an 
absorbance spectrum represents the absorbance of a single analyte in a 
sample, the concentration of the analyte may be determined by 
comparing the absorbance of the sample to the absorbance of a 
reference sample at the same wavelengths, where the reference sample 
contains a known concentration of the analyte. 
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One fundamental goal of a near-infrared spectroscopic method for 
biological fluid analyte concentration measurements such as blood 
glucose levels is to collect high quality data. Although great care may be 
taken to ensure reliable measurements by consistent sample preparation 
5 and data acquisition, data generated by instrumentation and clinical 

reference testing, like all data, are susceptible to the inclusion of errors 
from a number of sources. In large sets of data, it is not uncommon to 
have a number of measurements that are extremely deviant from the 
expected distribution of measurements, commonly referred to as outliers. 

10 Whether outliers result from statistical errors or systematic errors, outlier 
detection identifies samples containing such errors with sufficient 
confidence that such samples can be considered unique with respect to 
the sampled population. Inclusion of a small number of outliers within a 
set of measurements can degrade or destroy a calibration model that 

1 5 would otherwise be obtained by the measurements. 

Referring to the method and apparatus of the present invention, 
there are at least four potential sources of error in the chemometric 
analysis for biological fluid analyte measurements such as measurements 
of blood glucose levels. 

20 A first source of error is related to sample preparation; Blood 

serum samples require a great deal of preparation before chemometric 
analysis. During this preparation, a number of factors can affect the 
sample. For example, the amount of time that blood samples are allowed 
to clot may affect sample continuity in terms of fibrinogen content. The 

25 level of clotting also impacts the quality of centrifugation and ultimately 

the decanting of serum from cells. Samples prepared for clinical assays 
determine the quality of the data used for reference and calibration, so 
that great care must be exercised with the samples since this data will 
ultimately define the limit of prediction abilities. 

30 A second source of error may result from the spectral 

measurement process. For example, the use of a flowcell for sample 



containment during data acquisition is susceptible to problems such as 
bubbles in the optical path as well as dilution effects from reference 
saline solution carryover. These dilution effects are usually negligible, 
but bubbles in the optical path are not infrequent and have a severe 
impact on data quality. In addition, errors produced by mechanical or 
electronic, problems occurring within the analysis instrumentation can 
have important effects on data quality. 

A third source of error is also related to the reference tests. Errors 
due to out-of-specification instrumental controls and low sample volume 
during clinical assays have similar effects to errors related to sample 
preparation, described above. 

A fourth source of error, and probably the most difficult to identify 
and control, relates to sources of the samples, that is, to the individuals 
providing the biological fluids. A sample taken from an individual may at 
first seem to be quite unique with respect to a previously sampled 
population, but may in fact be an ordinary sample when a larger sample 
population is considered, that is, a putative unique sample may be only 
an artifact of undersampling. 

All of these errors, alone or in combination, can lead to a 
calculated value of biological fluid analyte concentration that is at great 
variance with respect to measurements from samples taken from the 
same individual at approximately the same time. These extremely 
deviant values, which can be orders of magnitude greater or less than a 
- predicted mean value, are outliers that should be identified prior to 
constructing a model for predicting biological fluid analyte 
concentrations. 

The removal of outliers from a data set can be accomplished in a 
qualitative and subjective sense by graphical inspection of plotted data in 
those cases when the dimensionality is low, that is, where the number of 
data points associated with each measurement is small. In those 
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instances where the number of data points associated with each 
measurement is large, however, outlier detection may be more quickly 
and efficiently be accomplished by a number of automatable procedures 
such as residual analysis. However, such procedures are often subject to 
5 a number of errors, or at least subject to errors in interpretation, 

especially in the relatively high dimensional spaces that are typically 
associated with multifactorial chemometric analyses. 
SUMMARY OF THE INVENTION 

To ensure accurate and consistent results, chemometric 

10 applications for biological fluid analyte measurement, such as glucose 

concentration determination, require multiple measurements taken from a 
number of individual test subjects over a period of time. However, even 
with consistent sample preparation and data acquisition, natural 
variations in samples and unintended errors can diminish the accuracy of 

1 5 results. Further, these errors are magnified by the relatively small 

number of biological fluid samples that can economically be drawn and 
tested. Automated techniques for outlier detection are necessary to 
assess the suitability of all acquired samples during both research phase 
and in final uses. The quality of data during clinical studies will define 

20 calibration models and the direction of subsequent research thrusts 

dependent upon results. In an end use, visual inspection of acquired data 
may or may not be possible. Even if inspection of the data is possible, 
independent objective methods of determination are needed which are 
not susceptible to subjective biases. 

25 In order to aid in the understanding of the present invention, it can 

be stated in essentially summary form that it is directed to a method and 
apparatus for measuring biological fluid analyte concentration using, 
outlier identification and removal based on generalized distances. The 
present invention improves the accuracy of biological fluid analyte 

30 concentration determination by identifying outlier values, and identifying 

and removing outliers from data before formation of a calibration model. 
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The present invention provides a method and apparatus whereby 
the concentration of an analyte in a sample of a biological fluid may be 
investigated by spectral analysis of electromagnetic radiation applied to 
the sample, including collecting calibration data, analyzing the calibration 
5 data to identify and remove outliers using the calibration model, 

constructing a calibration model, collecting unknown sample data, 
analyzing the unknown sample data to identify and remove outliers, and 
predicting analyte concentration of non-outliers in the unknown sample 
data by using the calibration model. 

10 The analysis of the calibration data set may include data 

pretreatment; data decomposition to remove redundant data, and 
identification and removal of outliers as having a low probability of class 
membership, using generalized distance methods. 

The construction of a calibration model may utilize principal 

15 component regression, partial least squares, multiple linear regression, or 
artificial neural networks, whereby the calibration data set may be 
reduced to significant factors using principal component analysis or 
partial least squares scores, enabling calculation of regression 
coefficients and artificial neural network weights. 

20 The unknown sample data may be analyzed using data 

pretreatment, followed by projection into the space defined by the 
calibration model, and identification and removal of outliers in the 
unknown sample data as having a low probability of class membership. 
The prediction of analyte concentration of an unknown sample may 

25 include projecting data from the unknown sample into the space defined 
by the calibration model, thereby enabling determination of the analyte 
concentration. 

A first embodiment of the apparatus of the present invention 
includes a pump into which a sample is introduced, the pump acting to 
30 circulate the sample through tubing to fill a flowcell, with the pump 
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capable of both stopped flow and continuous flow operation. A sample 
compartment housing containing the flowcell and a detector is 
temperature controlled by a temperature control unit. Light from 
relatively broad bandwidth near-infrared source is directed through a 
5 chopper wheel, and the chopper wheel is synchronized by a chopper 
synchronization unit with respect to the detector, facilitating the 
apparatus of the present invention to make both light and dark 
measurements to substantially eliminate electronic noise. Modulated light 
then passes through a monochrbmeter, allowing variance of the 

10 wavelength of radiation continuously over an, appropriate range. The 

monochromatic light passes through the flowcell and strikes the detector, 
whereby the amount of light transmitted through the sample is measured. 
Measurement data is stored in a general purpose programmable computer 
having a general purpose microprocessor, available for further processing 

15 according to the present invention. In addition, the computer may also 

control operation of the pump, the temperature control unit, the chopper 
synchronization unit, the chopper wheel, and the monochrometer. 

In a second embodiment of the apparatus of the present invention, 
light from the relatively broad bandwidth light source is directed through 

20 the chopper wheel, and thereafter modulated light is passed through a 

filter wheel, whereby discrete wavelengths of radiation may be selected 
and transmitted to the flowcell. 

In a third embodiment of the apparatus the present invention, a 
plurality of narrow bandwidth near-infrared sources, such as a plurality of 

25 laser diodes, is provided to produce near-infrared radiation at a 

preselected plurality of wavelengths. Light from a selected narrow 
bandwidth near-infrared source may be pulsed by a driver in 
synchronization with the detector and directed into the flowcell 106. 
Synchronization of the selected narrow bandwidth near-infrared 

30 source and the detector permits the apparatus to make both light and 



dark measurements, thereby substantially eliminating significant 
electronic noise. Selection of each of the set of narrow bandwidth near- 
infrared sources for emission of light to be transmitted into the flowcell : 
may be selected in a convenient order, for instance in order of increasing 
or decreasing wavelength, by configuring the computer to sequentially 
pulse each ojLthe set of narrow bandwidth near-infrared sources. 

In computer implementation of the method and apparatus of the 
present invention, variations in the intensity of transmitted light as a 
function of wavelength are converted into digital signals by the detector, 
with the magnitude of the digital signals determined by the intensity of 
the transmitted radiation at the wavelength assigned to that particular 
signal. Thereafter, the digital signals are placed in the memory of the 
computer for processing as will be described. 

The steps of the method of the present invention includes as a first 
step collecting data to be used in constructing a calibration model. After 
the calibration data have been collected, data pretreatment may be 
performed Jn order to remove or compensate for spectral artifacts such as 
scattering (multiplicative) effects, baseline shifts, and instrumental noise. 
Pretreatment of the calibration data may be selected from the group of 
techniques including calculating nth order derivatives of spectral data, 
multiplicative scatter correction, /7-point smoothing, mean centering, 
variance scaling, and the ratiometric method. 

Once data pretreatment, if any, has been performed on the raw 
calibration data, a calibration model may be formed. As near-infrared 
spectral data variables are highly correlated, to reduce the level of 
redundant information present, near-infrared spectral calibration data may 
be formed into a nxp matrix representing n samples, each measured at p 
wavelengths. The nxp matrix may be decomposed by principal 
component analysis into a set of n, /7-dimensional score vectors formed 
into a nxn score matrix, and a set of n, p-dimensional loading vectors 



WO 97/06418 




PCT/US96/12625 



8 

formed into an nxp loading matrix. The score vectors are orthogonal and 
represent projections of the n spectral samples into the space defined by 
the loading vectors and the major sources of variation. 

Principal component analysis generates a set of n eigenvectors and 
5 a set of n eigenvalues, A^ ^A 2 ^ ... ^A n . The eigenvalues represent the 
variance explained by the associated eigenvectors and can be divided 
into two sets. The first q eigenvalues are primary eigenvalues, 
A<i ^A 2 ^...^A g , and account for the significant sources of variations 
within the data. The remaining n-q secondary (error) eigenvalues 

10 >l q+1 Syi q+2 S... ^A q account for residual variance or measurement noise. 

The number of primary eigenvalues q may be determined by an 
iterative method which compares the q xh eigenvalue's variance to the 
variance of the pooled error eigenvalues via an F-test. Further, reduced 
eigenvalues may be utilized, which weight the eigenvalues by an amount 

1 5 proportional to the information explained by the associated eigenvectors. 
The q score values for each sample are used to represent the original 
data during outlier detection, with the original spectra projected into the 
nxq dimensioned principal component subspace defined by loading the 
matrix. 

20 Outliers may be identified using generalized distances, such as 

Mahalanobis distance or Robust distance. A generalized distance 
between a sample and the centroid defined by a set of samples may be 
determined using the variance-covariance matrix of the set of samples. 
Where the true variance-covariance matrix and the true centroid of a 

25 complete set of samples are unknown, a subset of the complete set may 
be used to form an approximate variance-covariance matrix and an 
approximate centroid. Further, by using principal component scores to 
represent spectral data for each sample, independent variables 
maximizing the information content may be obtained, insuring an 

30 invertible approximate variance-covariance matrix. With respect to 



Mahalanobis distance, an approximate centroid may be determined as the 
centroid of a multivariate normal distribution of the set of calibration 
samples and an approximate variance-covariance matrix of the set of 
calibration samples, whereby an approximate Mahalanobis distances in 
units of standard deviations measured between the centroid and each 
calibration sample may be found. With respect to Robust distance, by 
utilizing a minimum : volume ellipsoid estimator (MVE), robust estimates of 
an approximate variance-covariance matrix and an approximate centroid 
may be obtained. Alternatively, a projection algorithm may be used to 
determine the Robust distance for each calibration sample. 

After determining generalized distances for the calibration samples, 
the probability of class membership may be determined by a number of 
techniques, including evaluation of a chi-squared distribution function or 
utilizing Hotelling's T-statistic. Outliers are identified as having relatively 
large generalized distance which results in a relatively low probability of 
class membership. Samples whose class membership can be rejected at 
a confidence level that is greater than approximately 3-5a may be 
considered as outliers. Following identification, outliers in the calibration 
samples may be removed. The generalized distances of outliers removed 
from the calibration samples may be examined, to determine whether 
additional data pretreatment is necessary. In the event that a relatively 
large number of outliers have very large generalized distances, further 
pretreatment of the calibration data may be indicated. After such 
additional pretreatment, the calibration data may again be subjected to 
analysis. On the other hand, if relatively large numbers of outliers do not 
have very large generalized distances, then additional data pretreatment 
may not be appropriate. 

A calibration model may then be constructed utilizing any of a 
number of techniques, including principal component regression (PCR), 
partial least squares (PLS), multiple linear regression (MLR), and artificial 
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neural networks (ANN). The calibration model will seek to correlate a set 
of independent variables representing absorbance values of n samples 
each measured atp wavelengths, with a set of dependent or response 
variables representing the concentration of an analyte in each of the n 
samples, by using a p-dimensional regression coefficient vector. A 
calibration model determines regression coefficient vector and is used to 
predict the concentration of the analyte in other samples, given only the 
absorbances at the p wavelengths. 

As noted, near-infrared spectral data variables are highly correlated 
and while careful selection of the measurement wavelengths may 
minimize singularity problems, the spectral regions of interest may suffer 
from severe overlap and a high number of wavelengths is needed to 
model a multicomponent system. Data compression may be used to 
address problems with collinearity to determining regression coefficient 
vector, so that redundant data may be reduced down to significant 
factors. Principal component regression is one technique that 
incorporates a data compression method. The technique of partial least 
squares may also be used to address the problem of redundant data. 

With respect to both principal component regression and partial 
least squares, a determination is made of the appropriate number of score 
vectors or factors to be included in a calibration model that adequately 
represents the calibration data. The goal of selecting optimal number of 
factors for regression is to obtain parsimonious models with robust 
predictive abilities. Including too few factors causes model performance 
to suffer due to inadequate information during calibration, while including 
too many factors may also degrade performance. Principal components 
are normally sorted into an order so that the amount of variation 
explained by each principal component monotonically decreases. Later 
ordered principal components associated with small eigenvalues may be 
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considered as containing measurement noise. By utilizing only the first q 
factors and omitting remaining factors, a type of noise rejection may be 
incorporated within principal component regression. The number of 
principal component analysis or partial least squares scores or factors to 
use. during the regression step may be determined using the standard 
error of prediction, a measure of the error associated with each set of 
predictions. By plotting standard error of prediction against the number 
of factors used in each of the respective sets of predictions, a pi3cewise 
continuous graphical representation may be obtained and utilized to 
determine the number of factors to retain. One criterion for factor 
selection is to determine the first local minimum. Another technique for 
factor selection uses an F-test to compare standard error of prediction 
from models using differing numbers of factors. 

In certain instances, data being analyzed may not be amenable to 
being split into a calibration, training set and a validation, test set. The 
reason may be due to a limited number of available samples or that by 
splitting data into two sets, one or both of the resulting sets do not 
adequately represent the sample population. In such situations, the 
iterative technique of leave one out cross validation may be used where, 
during each iteration, a sample is excluded from the calibration set and is 
used as a test sample. Prediction models using factors determined from 
calibration samples are then used to make test sample predictions. The 
test sample is then returned to the calibration set and another sample is 
excluded. The same process is repeated until all samples have been 
excluded from the calibration set and predicted by models generated by 
the calibration samples. All predictions are accumulated to give a 
standard error of validation. 

Subsequent to determining the number of significant factors, the 
data set for the calibration model may be reduced to significant factors, 
and regression coefficients for the calibration model may be determined. 
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After construction of the calibration model, the calibration model 
may be applied to data collected from samples where concentration of 
analytes of interest are unknown. The unknown sample data may be 
appropriately pretreated and then projected into the principal component 
5 space defined by the calibration model. Next, generalized distances for 

the unknown sample data set may be found, using, for instance, either 
Mahalanobis or Robust distance as utilized with respect to the calibration 
data, and the probability of class membership may be estimated using the 
techniques described above, including evaluation of a chi-squared 

10 distribution function or utilizing Hotelling's T-statistic. Outliers in the 
unknown sample data are then identified based upon rejecting class 
membership at a confidence level that is greater than approximately 3-5a. 
As the final steps of the method of the present invention, in the event 
that an unknown sample is not an outlier, the sample is projected into the 

1 5 space defined by the calibration model, and a prediction of the 

concentration of the analyte made. On the other hand, if the unknown 
sample is an outlier, the unknown sample may be rejected and no 
prediction as to analyte concentration made, although if possible, 
remeasurement of the unknown sample may be made to verify that the 

20 sample is an outlier. 

With respect to the apparatus of the present invention, the steps 
previously described with respect to the method of the present invention 
may be configured on the general purpose microprocessor of the 
computer by employing computer program code segments according to 

25 each of such steps. 

As those skilled in the art will appreciate, the present invention is 
intended to encompass without limitation a range of embodiments that 
can be better understood with reference to the drawings and following 
detailed description of the preferred embodiments of the invention. 

30 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic block diagram of a first preferred 
embodiment of the apparatus for biological fluid analyte concentration 
measurement representing the present invention. 
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FIG. 2 is a schematic block diagram of a second preferred 
embodiment of the apparatus for biological fluid analyte concentration 
measurement representing the present invention. 

FIG. 3 is a schematic block diagram of a third preferred 
embodiment of the apparatus for biological fluid analyte concentration 
measurement representing the present invention. 

FIG. 4 is a flowchart representing initial steps of the method for 
biological fluid analyte concentration measurement representing the 
present invention. 

FIG. 5 is a flowchart representing intermediate steps of the method 
for biological fluid analyte concentration measurement representing the 
present invention.. 

FIG. 6 is a flowchart representing final steps of the method for 
biological fluid analyte concentration measurement representing the 
present invention. 

FIG. 7 is a scatter plot of principal component 2 versus principal 
component 1 of near-infrared spectra from 1 1 1 blood glucose samples in 
the range of 1580 nm to 1848 nm. 

FIG. 8 is a scatter plot of principal component 2 versus principal 
component 1 of near-infrared spectra from 1 1 1 blood glucose samples in 
the range of 2030 nm to 2398 nm. 

FIG. 9 is a scatter plot of principal component 3 versus principal 
component 2 of near-infrared spectra from 1 1 1 blood glucose samples in 
the range of 2030 nm to 2398 nm. 

FIG. 10 is a bar graph of calculated Mahalanobis distances for 103 
blood glucose samples in the range of 1 100 nm to 2398 nm taken from 
data depicted in FIGS. 7-9. 

FIG. 1 1 is a scatter plot of predicted blood glucose concentrations 
from 103 samples using data derived from 2030 nm to 2398 nm, 
generated from a partial least squares model optimized with twelve 
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factors attaining a standard error of validation of 64.10 mg/dL versus 
actual blood glucose concentrations. ~ 

FIG. 12 is a scatter plot of predicted blood glucose concentrations 
from 100 samples using data derived from 2030 nm to 2398 nm, 
5 generated from a partial least squares model optimized with eight factors 
attaining a standard error of validation of 27.43 mg/dL versus actual 
blood glucose concentrations. 

FIG. 13 is a bar graph of calculated Mahalanobis distances for 100 
blood glucose samples in the range of 1 580 nm to 1848 nm taken from 
10 data depicted in FIGS. 7-9. 

FIG. 14 is a bar graph of calculated Mahalanobis distances for 100 
blood glucose samples in the range of 2030 nm to 2398 nm taken from 
data depicted in FIGS. 7-9. 

FIG. 15 is a scatter plot of predicted blood glucose concentrations 
15 from 95 samples using data derived from 2030 nm to 2398 nm, 

generated from a partial least squares model optimized with eight factors 
attaining a standard error of validation of 26.97 mg/dL versus actual 
blood glucose concentrations. 

FIG. 16 is a table representing a summary of outlier detection 
20 results for 111 blood glucose samples over the spectral ranges 1 580 nm 

to 1848 nm and 2030 nm to 2398 nm utilizing the present invention, and 
indicating possible causes of sample error. 

FIG. 17 is a graph of the standard error of prediction versus the 
numbers of factors used during regression. 
25 DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The following portion of the specification, taken in conjunction 
with the drawings, sets forth the preferred embodiments of the present 
invention. The embodiments of the invention disclosed herein are the 
best modes contemplated by the inventors for. carrying out their invention 
30 in a commercial environment, although it should be understood that 
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various modifications can be accomplished within the parameters of the 
present invention. 

Referring now to the drawings for a detailed description of the 
present invention, reference is first made to FIG. 1, depicting a first 
preferred embodiment of an apparatus for biological fluid analyte 
concentration measurement. In apparatus 100, a biological fluid sample 
may be introducedjnto pump 102 which circulates the sample through 
tubing 104 to fill flowcell 106. Pump 102 may be capable of both 
stopped flow and continuous flow operation. Sample compartment 108 
contains flowcell 106 and detector 110, and is temperature controlled by 
temperature control unit 1 1 2. Light from relatively broad bandwidth 
near-infrared source 1 1 4 is directed through chopper wheel 116. 
Chopper wheel 116 is synchronized by chopper synchronization unit 118 
with respect to detector 116, facilitating apparatus 1 00 to make both 
light and dark measurements to substantially eliminate electronic noise. 
Modulated light then passes through monochrometer 120, allowing 
continuous variance of the wavelength of radiation over an appropriate 
range. - The monochromatic light passes through flowcell 106 and strikes 
detector 110. Detector 110 measures the amount of light transmitted 
through the sample. Measurement data is then stored in general purpose 
programmable computer 1 24 having a general purpose microprocessor, 
where the data will be available for further processing as will be 
described. In addition, computer 124 may also control operation of 
pump 102, temperature control unit 112, chopper synchronization 
unit 118, chopper wheel 116, and monochrometer 120. 

In a second embodiment of apparatus 100 as depicted in FIG. 2, 
light from relatively broad bandwidth source 1 14 is directed through 
chopper wheel 1 16, and thereafter the modulated light is passed through 
filter wheel 130 whereby discrete wavelengths of radiation may be 
selected and transmitted to flowcell 106. 
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In a third embodiment of apparatus 100 of the present invention as 
depicted in FIG. 3, a plurality of narrow bandwidth near-infrared 
. sources 1 34, such as a plurality of laser diodes, is provided to produce 
near-infrared radiation at a preselected plurality of wavelengths. Light 
5 from a selected narrow bandwidth near-infrared source 134 may be 

pulsed by driver 138 in synchronization with detector 110 arid directed 
into flowcell 106. Synchronization of the selected narrow bandwidth 
near-infrared source 134 and detector 110 permits apparatus 100 to 
make both light and dark measurements, thereby substantially eliminating 

10 electronic noise. Selection of each of the set of narrow bandwidth near- 
infrared sources 1 34 for emission of light to be transmitted into flowcell 
106 may be selected in a convenient order, for instance in order of 
increasing or decreasing wavelength, by configuring computer 124 to 
sequentially pulse each of the set of narrow bandwidth near-infrared 

15 sources. 

Referring to FIGS. 1-3, in computer implementation of the 
apparatus and method of the present invention, variations in the intensity 
of transmitted light as a function of wavelength are converted into digital 
signals by the detector, with the magnitude of the digital signals 

20 determined by the intensity of the transmitted radiation at the wavelength 
assigned to that particular signal. Thereafter, the digital signals are 
placed in the memory of computer 124, for processing as will be 
described. 

As symbolically depicted in FIG. 4, step 1 in the method of the 
25 present invention refers to collecting data to be used in performing 

calibration and thereafter constructing a calibration model. After the 
calibration data have been collected, data pretreatmerit of step 2 may be 
performed, as it is often necessary to pretreat raw spectral data prior to 
data analysis or calibration model building in order to remove or 
30 compensate for spectral artifacts such as scattering (multiplicative) 
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effects, baseline shifts, and instrumental noise. Pretreatment of the 
calibration data may be selected from the group of techniques including 
calculating nth order derivatives of spectral data, multiplicative scatter 
correction, /7-point smoothing, mean centering, variance scaling, and the 
ratiometric method. 

. Once data pretreatment, if any, has been performed on the raw 
calibration data, steps directed towards forming a calibration model may 
are taken. With reference to step 3 as depicted in FIG. 4, near-infrared 
spectral data variables are highly correlated. To reduce the level of 
redundant information present, near-infrared spectral calibration data may 
be formed into a nxp matrix X representing n samples, each measured at 
p wavelengths, and may be decomposed by principal component analysis 
into a set of n, /7-dimensional score vectors formed into a nxn score 
matrix 7, and a set of n, p-dimensional loading vectors formed into an 
nxp loading matrix L, with 

(1) 

In most spectroscopic applications, p > n, so that decomposition may be 
considered decomposing matrix X of rank n into a sum of n rank 1 
matrices. The score vectors represent projections of the n spectral 
samples in X into the space defined by the loading vectors. The score 
matrix T represents the major sources of variation found within X, and 
the column vectors in T are orthogonal. 

Referring to steps 4 and 5 as depicted in FIG. 4, principal 
component analysis generates a set of n eigenvectors and a set of n 
eigenvalues, A y 2>A 2 *£...2:A n . The eigenvalues represent the variance 
explained by the associated eigenvectors. The eigenvalues may be 
divided into two sets. The first q eigenvalues are primary eigenvalues 
y» 1 ^t 2 ^...-^l 9 and account for the significant sources of variations within 
the data. The remaining n-q secondary, or error, eigenvalues 
y\ q+1 S>l q+2 ^»..^>l / , account for residual variance or measurement noise. 
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With reference to steps 6 and 7 of FIG. 4, the number of primary 
eigenvalues q may be determined by an iterative method which compares 
the qr th eigenvalue's variance to the variance of the pooled error 
eigenvalues via an F-test, 

s . x 

F(l,n-q) = n 1 — (n-q).. (2 ) 

In addition, reduced eigenvalues which weight the eigenvalues by an 
1 0 amount proportional to the information explained by the associated 

eigenvectors may be utilized with the reduced eigenvalue is defined as 

i - k . . .. 

9 (n-q+1) (p-gr+1) ' (3) 
15 so that equation 2 may be expressed as 

n 

(p-j+1) (n-j+l) - (4 ) 



F(l,n-q) =-2^211 



(p-g+1) (n-q+1) 



E x i 



20 The / th sample in the principal component subspace is represented by the 
q score values of f/. The q score values for each sample are used to 
represent the original data during outlier detection. In doing so, the 
original spectra are projected into the nxq dimensioned principal 
component subspace defined by loading rhatrix L. 

25 As depicted symbolically in steps 8 and 9 of FIG. 4, outliers may 

be identified using generalized distances, such as Mahalanobis distance 
or Robust distance. A generalized distance between a centroid //of a set 
of samples and the 7 th sample Xj may be determined from 

30 (5) 



where Z is the variance-covariance matrix of the set of samples. Where 
the true variance-covariance matrix and the true centroid of a complete 
set of samples are not determinable, a subset of the complete set of 
samples may be used to form an approximate variance-covariance matrix 
and an approximate centroid. In addition, by using principal component 
scores to represent spectral data for each sample, independent variables 
are orthogonal thus maximizing the information content and insuring an 
invertible approximate variance-covariance matrix. 

Generalized distances may be Mahalanobis distances as described 
in step 10a of FIG. 4, with an approximate centroid "x determined as the 
centroid of a multivariate normal distribution of the set of calibration 
samples and an approximate variance-covariance matrix of the set of 
calibration samples S. An approximate Mahalanobis distances MD; in 
units of standard deviations measured between the centroid and an 7 th 
calibration sample Xj may thus be determined from 

MD ± - [ S" 1 (x^x) c ] 1/2 

(6) 

where 

£ (x ± -x) t {x r x) (7) 



5= 



(<3T-D 



With respect to Robust distance as depicted in step 10b of FIG. 4, 
by utilizing a minimum volume ellipsoid estimator (MVE), robust estimates 
of the approximate variance-covariance matrix S Robust and the approximate 
centroid x fiobust may be obtained, with Robust Distances RD f for the I th 
calibration sample determined from 

RD±= t (Xi~X Rohust ) S Ro bust ( x i~ X Robust} C ] 1/2 " 
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Alternatively, a projection algorithm may be used to determine the Robust 
distance RD, for the / h calibration sample from 



^maxl^^^^---^'} (9) 
'ZiXjVj, . . . ,x„y>l) 



for g= 1 ,...,n and where a scale of a minimum volume ellipsoid is given 
10 by 
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and a location of a minimum volume ellipsoid is given by 



(10) 



(X,v£+Jt:_nv£) 

Xf a p-dimensional vector representing the calibration sample, and v g is 
a p-dimensional vector representing the g th calibration sample defined by 

v g =x g -M 

(12) 

where M is a p-dimensional vector such that the r* h component of M is 
given by the median of a set formed by the /^ component of each of the 
n vectors x, . For each value of g= 1,...,/?, index / used in equations 10 
25 and 1 1 is determined from 

where x,v g £ x 2 v % £ x 3 v z £ x n v r 

After determining the generalized distances for the calibration 
30 samples, referring to step 1 1 shown in FIG. 4, the probability of class 
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membership may be determined by a number of techniques, including 
evaluation of a chi-squared distribution function or utilizing Hotelling's T- 
statistic. As depicted in step 1 2, outliers are identified as having 
relatively large generalized distance which results in a relatively low 
probability of class membership. Generally speaking, samples whose 
class membership can be rejected at a confidence level in the range of 
approximately 3-5a may be considered as outliers. Following 
identification, outliers in the calibration samples may be removed as 
depicted in step 13. Further, as indicated in step 14, the generalized 
distances of outliers removed from the calibration samples may be 
examined to determine whether additional data pretreatment is 
necessary. In the event that a relatively large number of outliers have 
very large generalized distances, further pretreatment of the calibration 
data may be indicated. If such further pretreatment of the calibration 
data is indicated, then after such pretreatment, the calibration data will 
again be subjected „ to the steps previously described beginning at step 2. 
On the other hand, if relatively large numbers of outliers do not have very 
large distances, then additional data pretreatment may not be 
appropriate. 

Thereafter, as indicated by step 15 shown in FIG. 5, a calibration 
model may be constructed utilizing any of a number of techniques, 
including principal component regression (PGR), partial least squares 
(PLS), multiple linear regression (MLR), and artificial neural networks 
(ANN). The calibration model will seek to correlate a set of independent 
variables representing absorbance values of n samples measured at p 
wavelengths, symbolically represented by the nxp matrix X, with a set of 
dependent or response variables representing the concentration of an 
analyte in each of the n samples, symbolically represented by vector y. y 
is an n-dimensional vector, or alternatively, may be considered to be an 
nx1 matrix. After mean centering X and y, the relationship between X 
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and y may be expressed as 

y=Xb+€ 

(14) 

where b represents a p-dimensional regression coefficient vector (px1) 
5 matrix) and e is an n-dimensional vector (nx 1 matrix) representing errors 
in y. The calibration model determines vector b, using 

b=(X t X)~ 1 X t y. 

(15) 

Knowledge of b is used to predict the concentration of the analyte, y, in 

10 unknown samples, given only absorbances at each of the p wavelengths. 

Referring to step 1 6, the determination of (X'X)" 1 may be difficult 
as collinearity is inherent in spectroscopic data. As described, near- 
infrared spectral data variables are highly correlated. While careful 
selection of the measurement wavelengths may minimize singularity 

1 5 problems, the spectral regions of interest may suffer from severe overlap 
and a high number of wavelengths is needed to model a multicomponent 
system. Data compression may be used to address problems with 
collinearity to determining regression coefficient vector b, so that 
redundant data may be reduced down to significant factors. 

20 Principal component regression is one technique to determine 

vector b that incorporates a data compression method. The first step in 
principal component regression is to perform principal component 
analysts on the calibration data as formed into matrix X. The score matrix 
T represents the major sources of variation found within X, and the 

25 column vectors in 7 are orthogonal. As a result, in the next step in 

principal component regression, Tis used in place of X whereby an 
approximate value of b is found using 



30 



as {TT*) is invertible. 



(16) 
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The techniques of partial least squares may also be used to 
address the-problem of redundant data. One difference between partial 
least squares and principal component regression is the way in which the 
score matrix T and the loading matrix L are generated. As described, in 
principal component regression,, using non-linear iterative partial least 
squares (NIPALS), loading vectors are extracted one at a time in the order 
of their contribution to the variance in X. As each loading vector is 
determined, it is removed from X and the next loading vector is 
determined. This process is repeated until n loadings have been 
determined. In partial least squares, concentration, y block, information 
is used during iterative decomposition of X. With concentration 
information incorporated into L, T values are related to concentration as 
well as placing useful predictive information into earlier factors as 
compared to principal component regression. 

VVith/espect to both principal component regression and partial 
least squares, determination must be made of the appropriate number of 
score vectors or factors to be included in a calibration model that 
adequately represents the calibration data. The goal of selecting optimal 
number of factors for regression is to obtain parsimonious models with 
robust predictive abilities. Including too few factors causes model 
performance to suffer due to inadequate information during calibration. 
Including too many factors may also degrade performance. Principal 
components are normally sorted into an order so that the amount of 
variation explained by each principal component monotonically 
decreases. Later ordered principal components associated with small 
eigenvalues may be considered as containing measurement noise. By 
utilizing only the first q factors and omitting the remaining factors, a type 
of noise rejection may be incorporated within principal component 
regression. The number of principal component analysis or partial least 
squares scores or factors, q, to use during the regression step may be 
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determined as follows. In the case of matrix X with rank/7,./? preliminary 
calibration models are built. Each preliminary calibration model uses a 
different number of score vectors selected from the range of 1. through n 
score vectors. Predictions are then made form the n preliminary 
calibration models using the standard error of prediction -techniqueJhe 
standard error of prediction (SEP) is a measure of the error associated 
with each set of predictions and is given by 



SEP(k) = 



i-l - 

n-1 , 



(17) 
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where the number of test set samples is given by n and 



(18) 



15 By plotting standard error of prediction against the number of factors 

(score vectors) used, denoted by k, in each of the respective sets of 
predictions, a piecewise continuous graphical representation such as FIG. 
1 7 may be obtained and utilized to determine the number of factors to 
retain. One criterion for factor selection is to determine the first local 
minimum. Applying a first local minimum criterion to the data graphed in 
FIG. 17, eight factors would be selected for the calibration model. A 
general interpretation of FIG. 17 is that significant information is being 
incorporated into the calibration model in factors one through six. As 
factors seven and eight are included, subtleties in the data are included. 
For factors nine through fifteen, variations or measurement noise specific 
to the calibration set are being modeled, so errors increase. Another 
technique for factor selection uses an F-test td compare standard error of 
prediction from models using differing numbers of factors. An f-test 
factor optimization would find that the standard error of prediction an 
30 eight factor model does not vary significantly from the standard error of 
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prediction of a six factor model, whereby six factors is seen to be 
optimal. 

In certain instances, data being analyzed may not be amenable to 
being split into a calibration, training set and a validation, test set. The 
reason may be due to a limited number of available samples or that by 
splitting data into two sets, one or both of the resulting sets do not 
adequately represent the sample population. The technique of leave one 
out cross validation may be used in such a situation. Leave one out 
cross validation is an iterative process, where during each iteration, a 
sample is excluded from the calibration set and is used as a test sample. 
Prediction models using 1 through /?- 1 factors determined from n-1 
calibration samples are then used to make test sample predictions. The 
test sample is then returned to the calibration set and another sample is 
excluded. The same process is repeated until all n samples have been 
excluded from the calibration set and predicted by models generated by 
the n-1 calibration samples. All predictions are accumulated to give the 
standard error of validation (SEV) given by 



SEV(k) 

where the subscript W represents the f h leave one out iteration which 
leaves out the f h sample, with the standard error of validation then 
treated as standard error of prediction. 

Referring to step 17, as depicted in FIG. 5, after determining the 
number of significant factors, data for the calibration model may be 
reduced to significant factors, and regression coefficients for the 
calibration model may be determined. 

After construction, the calibration model as described above may 
be applied to data collected from samples where concentration of 



(19) 



n-l 
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analytes of interest are unknown, symbolically indicated in FIG. 6 as step 
18. The unknown sample data may be appropriately, pretreated as 
indicated at step 1 9 f with similar techniques to those described above 
with respect to pretreatment techniques capable of use with calibration 
5 data. Upon completion of pretreatment, the sample data may be 

projected into the principal component space that was previously defined 
by the calibration model, as indicated in step 20. In step 21, generalized 
distances for the unknown sample is found using the generalized 
distance, such asMahalanobis or Robust distances, that was utilized 

10 with respect to the calibration data. The probability of class membership 
may be estimated using the techniques described above, including 
evaluation of a chi-squared distribution function or utilizing Hotelling's T- 
statistic. Referring next to step 22, unknown sample outlifers may then 
identified based upon rejecting class membership at a confidence level 

1 5 that is in the approximate range of 3-5a. In the event that an unknown 
sample is not an outlier, as in step 23a, the unknown sample may be 
projected into the space defined by the calibration model, and a 
prediction of the concentration of the anaiyte may be made. However, if 
the unknown sample is an outlier, as in step 23b, the unknown sample 

20 should be rejected and no prediction as to anaiyte concentration is made, 
although if possible, remeasurement of the unknown sample may be 
made for reanalysis to verify that the unknown sample is indeed an 
outlier. 

With respect to the apparatus of the present invention, it will be 
25 understood that the steps previously described with respect to the 

method of the present invention may be configured on the general 
purpose microprocessor of computer 1 24 by employing computer 
program code segments according to each of such steps. 

In use, the method and apparatus of the present invention was 
30 applied to blood glucose concentration data obtained from samples from 
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1 1 1 individuals. Six of the samples did not have enough serum to collect 
a near-infrared spectrum, so that vectors of zeros were used to fill their 
position within the data matrix in order to maintain succession number 
integrity during data manipulation. The six samples and the associated 
reference tests were .omitted from future analyses. Two other samples 
were associated with reference test errors and were omitted, leaving 1 03 
samples in the data set. 

Potential outliers were identified through visual inspection of two 
dimensional and three dimensional scatter plots of principal component 
scores. FIGS. 7-9 depict separate principal component analyses of two 
spectral regions performed. Vectors of zeros, indicated by reference 
numeral 200, lie far from the main group of data, as expected. The near- 
infrared spectra of three samples, indicated by reference numerals 23, 
67, and 83, each exhibited indications of interference due to bubbles in 
the optical path of the flowcell. As shown in FIGS, 7-9, such 
interference was present across the spectrum utilized as shown by 
distance of samples 23, 67, and 85 from the main group. In FIG. 7, 
samples 28 and 44 are seen to be potential outliers, as are samples 3 
and 4 in FIG. 8. In FIG. 9, samples 3, 4, and 44 are potential outliers. 

Mahalanobis distances were calculated for the 1 03 samples, as 
shown in FIG. 10, wherein samples 23, 67, and 83 are seen to have 
Mahalanobis distances much greater than the other samples. Further, in 
FIGS. 10, 13, and 14, omitted samples are depicted as having zero 
Mahalanobis distance. A number of additional samples appear in FIG. 10 
to be outlier candidates, including samples 3, 4, and 44. The data were 
subjected to further analyses, as will be described, with samples 23, 67, 
and 83 omitted, leaving 100 samples in the data set. 

The detrimental impact of including outlier samples in a data set is 
illustrated in FIGS. 1 1 and 12. FIG. 1 1 depicts a scatter plot of predicted 
blood glucose concentrations from 1 03 samples using data derived from 
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2030 nm to 2398 nm generated from a partial least squares model 
optimized with twelve factors attaining a standard error of validation of 
64.10 mg/dL versus actual blood glucose concentrations. With samples 
23, 67, and 83 removed, FIG. 12 depicts a scatter plot of predicted 
5 blood glucose concentrations from 100 samples using data derived from 
2030 nm to 2398 nm, generated from a partial least squares model 
optimized with eight factors attaining a standard error of validation of 
27.43 mg/dL versus actual blood glucose concentrations. With gross 
outliers eliminated, the partial least squares technique utilized in the 

10 method of the present invention was able to make better predictions and 
use a less complex model, that is, a model using fewer factors. The 
sample depicted in FIG. 11 having a predicted value of approximately 
750 mg/dL corresponded to sample indicated by reference numeral 83. If 
sample 83 in FIG. 1 1 is ignored and the remaining samples in FIG. 1 1 are 

1 5 compared with those in FIG. 12, it is apparent that there is a wider 

spread of data about the identity line in FIG. 1 1 . These results illustrate 
the influence of a relatively small number of outliers in seriously 
degrading the overall performance of a calibration model. 

Two spectral regions of the 100 samples were tested separately 

20 for outliers, with Mahalanobis distances for each of the regions shown in 
FIG. 13 and 14. Nine samples were flagged as possible outliers in the 
1 580 nm to 1 848 nm region, and six samples were flagged in the 2030 
nm to 2398 nm region as possible outliers. As is apparent from 
comparison of FIGS. 13 and 14, the flagged samples were different in 

25 the two spectral regions. Outliers may be selected to be those flagged 

samples that are excluded from class membership in either or both 
spectral ranges, at a confidence level selected to be in the range of 3-5a. 
Four of the samples rejected were also identified as possible outliers from 
the principal component score plots, FIGS. 7-9. Identification of the fifth 

30 sample required examination in the higher dimensional space associated 
with Mahalanobis distances. . , - 
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FIG. 16 sets forth a summary of 95 samples representing both 
major spectral regions examined using the method and apparatus of he 
present invention/ and shows that blood glucose concentration 
predictions using the 95 and 1 00 sample data sets and the same spectral 
regions yielded very similar results. A slight reduction in prediction error 
to SEV of 26.97 mg/dL with respect to the 100 sample set depicted in 
FIG. 12 resulted for. the 95 sample set depicted in FIG. 15 for the 2030 
nm to 2398 nm region, with the difference representing approximately a 
1% reduction in error. An F-test at the 95% confidence level did not find 
this, a significant difference. Comparison of the partial least squares 
results from other spectral regions with various forms of data 
preprocessing yielded similar findings. 

If a Mahalanobis distance threshold of 3.0 is used to determine 
outliers, a set of 89 samples results. Utilizing a partial least squares 
technique, leave-onerout validation on the set of 89 samples resulted in 
an SEV of 27.95 mg/dL a slight increase over the 100 and 95 sample 
sets. It was separately determined that the six samples omitted in the 89 
sample set with respect to the 95 sample set corresponded to samples 
having a high triglyceride concentration, a high total protein value, or 
both. The presence of the six outliers constituted an artifact of 
undersampling, that is, if a greater number of representative samples 
with high triglyceride or total protein concentrations were present in the 
original set of samples, samples having high triglyceride or total protein 
concentrations would be less likely to be flagged as outliers. 

Sensitivity of outlier detection to triglyceride or any other analyte 
which affects spectral response may be advantageous, however. 
Spectral data may be partitioned such that samples with high 
triglycerides form a first calibration set while samples with low 
triglycerides form a second calibration set, so that new samples may be 
tested with the method and apparatus of the present invention to 
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determine whether the first or second calibration set is representative of 
the new sample, thus allowing the selection of a prediction model 
determined from "similar" calibration spectra. 

The present invention having been described in its preferred 
5 embodiments, it is clear that it is susceptible to numerous modifications 
and embodiments within the ability of those skilled in the art and without 
the exercise of the inventive faculty. As will be appreciated by those 
skilled in the art, the method and apparatus of the present invention 
encompasses alternative biological fluid analyte measurement techniques, 

1 0 including biological fluid analyte concentrations derived using light 

reflectance, light transmission, and other techniques used in conjunction 
with invasive, non-invasive, and in-vivo biological fluid analyte 
measurement techniques. In addition, measurements of biological fluid 
analytes may also include triglycerides, cholesterol, and serum proteins, 

1 5 with outlier detection using the method and apparatus of the present 

invention. 
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WHAT IS CLAIMED IS: 

T. An improved method for forming a calibration model for use 
in determiningconcentration of an analyte of a biological fluid of a 
mammal, comprising the steps of: 

collecting a set of calibration samples from a plurality of sources of 
the biological fluid; . 

generating; near-infrared electromagnetic radiation having a plurality 
of wavelengths; 

irradiating each of the calibration samples with the radiation So 
that a portion of the radiation at each of the wavelengths is transmitted 
through: each of the calibration samples; 

measuring intensity of the radiation transmitted through each of 
the calibration samples at each of the wavelengths thereby forming a set 
of calibration data; ^ 

processing the set of calibration data, including forming the set of 
calibration data into a nxp matrix defining a space, wherein n is the 
number of calibration samples and p is the number of wavelengths at 
which intensity of transmitted radiation is measured, forming a subspace 
of the space wherein sources of relatively greater variations within the 
set of calibration data are represented, projecting the set of calibration 
data into the subspace, determining a generalized distance within the 
subspace between each calibration sample and a centroid of a 
distribution formed by the set of calibration samples, identifying 
calibration outliers as those calibration samples having a generalized 
distance greater than a preselected magnitude, forming a reduced set of 
calibration samples from calibration samples remaining after removal of 
calibration outliers; and 

constructing a calibration model from the reduced set of calibration 
samples to predict concentration of the analyte in an unknown sample of 
the biological fluid. 
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2. The method as set forth in claim 1 , wherein: 

the step of forming a subspace includes decomposing the matrix 
by principal component analysis into an nxn dimensional score matrix and 
an nxp dimensional loading matrix, generating by principal component 
analysis a set of n eigenvectors and a set of n eigenvalues associated 
with the eigenvectors and arranged in order of decreasing magnitude, 
dividing the set of eigenvalues into a set of q larger, primary eigenvalues 
and a set of n-q smaller, error eigenvalues whereby the primary 
eigenvalues are associated with relatively more significant sources of 
variations within the set of calibration data and the error eigenvalues are 
associated with relatively less significant sources of variation within the 
set of calibration data, and generating the subspace as an nxq 
dimensioned principal component subspace from the space defined by 
the loading matrix; and 

the step of constructing a calibration model includes forming a 
regression coefficient matrix correlating the reduced set of calibration 
samples with the concentration of the analyte in the reduced set of 
calibration samples whereby the regression coefficient matrix may be 
used to predict concentration of the analyte in an unknown sample of the 
biological fluid given the intensity of the radiation transmitted 
therethrough at each of the wavelengths. ; 

3, The method as set forth in claims 1 or 2, wherein each of 
the generalized distances is a Mahalanobis distance determined from the 
following relationship: ^ 



wherein MDj is the Mahalanobis distance between an P h calibration 
sample x, and the centroid x of the set of calibration samples, S* 1 is the 
inverted variance-covariance matrix of the set of calibration data, and 
(Xj - xT is the transpose of fx f - x). 



MD d = [ (x-x) S' 1 (Xjj-x)"*] 1/2 
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4. The method as set forth in claims 1 or 2, wherein each 
generalized distance is a Robust distance determined using an algorithm 
selected from the group consisting of minimum volume ellipsoid estimator 
and projection algorithm, 

5. The method as set forth in claims 1 or 2, further including 
the step of pretreating the set of calibration data to remove and 
compensate for spectral artifacts prior to the step of processing the set 
of calibration data. 

6. The method as set forth in claim 5, wherein the step of 
pretreating the set of calibration data is performed using an algorithm 
selected from the group consisting of nth order derivatives, multiplicative 
scatter correction, v?-point smoothing, mean centering, variance scaling, 
and ratiometric method. 

7. The method as set forth in claims 1 or 2 f further including 
the steps of: 

forming a ratio of the number of calibration outliers to the number 
of calibration samples; 

determining whether the ratio is greater than a preselected ratio; 

and 

pretreating the set of calibration data to remove and compensate 
for spectral artifacts prior to the step of processing the set of calibration 
data if the ratio exceeds the preselected ratio. 

8. The method as set forth in claims 1 or 2, wherein the step 
of identifying calibration outliers includes selecting the magnitude by 
determining a probability that each member of the set of calibration 
samples belongs to a class defined by a preselected probability 
distribution function whereby calibration outliers are identified as 
calibration samples whose class membership may be rejected at a 
confidence level greater than a preselected level. 

9. The method as set forth in claim 8, wherein the probability 
distribution function is formed using an algorithm selected from the group 
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consisting of chi-squared distribution function evaluation and Hotelling's 
T-statistic evaluation. 

10. The method as set forth in claim 8, wherein the preselected 
level is in the range of approximately 3 to 5 standard deviations as 

5 defined by the probability distribution function. 

1 1 . An improved method for determining .concentration of an 
analyte of a biological fluid of a mammal, comprising the steps of: 

collecting a set of calibration samples from a plurality of sources of 
the biological fluid and an unknown sample from an unknown source of 
10 the biological fluid; 

generating near-infrared electromagnetic radiation having a plurality 
of wavelengths; - - y 

irradiating each of the calibration samples and the unknown sample 
with the radiation so that a portion of the radiation at each of the 
1 5 wavelengths is transmitted through each of the calibration samples and 

the unknown sample; 

measuring intensity of the radiation transmitted through each of 
the calibration samples at each of the wavelengths thereby forming a set 
of calibration data and through the unknown sample at each of the 
20 wavelengths thereby forming a set of sample data; 

processing the set of calibration data, including forming the set of 
calibration data into a nxp matrix defining a space, wherein n is the 
number of calibration samples and p is the number of wavelengths at 
which intensity of transmitted radiation is measured, forming a subspace 
25 of the space wherein sources of relatively greater variations within the 

set of calibration data are represented, projecting the set of calibration 
data into the subspace, determining a generalized distance within the 
subspace between each calibration sample and a centroid of a 
distribution formed by the set of calibration samples, identifying 
30 calibration outliers as those calibration samples having a generalized 
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distance greater than a preselected magnitude, forming a reduced set of 
calibration samples from calibration samples remaining after removal of 
calibration outliers; 

constructing a calibration model from the reduced set of calibration 
samples to predict concentration of the analyte in the unknown sample; 
and 

applying the calibration model to the set of sample data including 
projecting the set of sample data into the space defined by the model, 
determining a generalized distance for the unknown sample according to 
the modei, identifying the unknown sample as a sample outlier provided 
the generalized distance of the unknown sample is greater than the 
preselected magnitude, and predicting concentration of the analyte in the 
unknown sample according to the model provided the generalized 
distance of the unknown sample is not greater than the preselected 
magnitude. 

12, : The method as set forth in claim 1 1 , wherein: 
the step of forming a subspace includes decomposing the matrix 
by principal component analysis into an nxn dimensional score matrix and 
an nxp dimensional loading matrix, generating by principal component 
analysis a set of n eigenvectors and a set of n eigenvalues associated 
with the eigenvectors and arranged in order of decreasing magnitude, 
dividing the set of eigenvalues into a set of q larger, primary eigenvalues 
and a set of n-q smaller, error eigenvalues whereby the primary 
eigenvalues are associated with relatively more significant sources of 
variations within the set of calibration data and the error eigenvalues are 
associated with relatively less significant sources of variation within the 
set of calibration data, and generating the subspace as an nxq 
dimensioned principal component subspace from the space defined by 
the loading matrix; and 

the step of constructing a calibration model includes forming a 
regression coefficient matrix correlating the reduced set of calibration 
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samples with the concentration of the analyte in the reduced set of 
calibration samples whereby the regression coefficient matrix may be 
used to predict concentration of the analyte in an unknown sample of the 
biological fluid given the intensity of the radiation transmitted 
therethrough at each of the wavelengths. 

13. The method as set forth in claims 1 1 or 12, wherein each of 
the generalized distances of the set of calibration samples is a 
Mahalanobis distance determined from the following relationship: 



10 



wherein MD f is the Mahalanobis distance between an / h calibration 
sample x, and the centroid x of the set of calibration samples, S 1 is the 
inverted variance-covariance matrix of the set of calibration data, and 
(Xj - x) x is the transpose of fx i - x), and wherein the generalized distance of 
1 5 the unknown sample according to the model is a Mahalanobis distance 
determined from the following relationship: 

sample' £ t x sampl e ~ K model ) & 1 ^ x sample" x model ) ^ 1/2 

wherein MD 8ample is the Mahalanobis distance between the unknown 
20 sample and the centroid x mode) of the model, S' 1 modet is the inverted 
variance-covariance matrix of the model, and (x sample -*x nso < i J t is the 
transpose of (x 8ample 

14. The method as set forth in claims 11 or 12, wherein each of 
the generalized distances of the set of calibration data is a Robust 

25 distance determined using an algorithm selected from the group 

consisting of minimum volume ellipsoid estimator and projection 
algorithm. 

1 5. The method as set forth in claims 1 \ or 12, further including 
the steps of: 

30 forming a ratio of the number of calibration outliers to the number 

of calibration samples; 
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determining whether the ratio is greater than a preselected ratio; 

pretreating the set of calibration data to remove and compensate 
for spectral artifacts prior to the step of processing the set of calibration 
data if the ratio exceeds the preselected ratio; and 

pretreating the sample data to remove and compensate for spectral 
artifacts prior to the step of applying the calibration model to the sample 
data if the ratio exceeds the preselected ratio. 

1 6. The method as set forth in claims 1 1 or 1 2, further including 
the steps of : 

pretreating the set of calibration data to remove and compensate 
for spectral artifacts prior to the step of processing the set of calibration 
data; and 

pretreating the sample data to remove and compensate for spectral 
artifacts prior to the step of applying the calibration model to the sample 
data. 

17. . ^ The method as set forth in claim 16, wherein the steps of 
pretreating the set of sample data and pretreating the set of calibration 
data are each performed using an algorithm selected from the group 
consisting, of nth order derivatives, multiplicative scatter correction, 
n-point smoothing, mean centering, variance scaling, and ratiometric 
method. 

18. The method as set forth in claims 11 or 12, wherein the 
step of identifying calibration outliers includes selecting the magnitude by 
determining a probability that each member of the set of calibration 
samples belongs to a class defined by a preselected probability 
distribution function whereby calibration outliers are identified as 
calibration samples whose class membership may be rejected at a 
confidence level greater than a preselected level, and wherein the step of 
identifying a sample outlier includes determining whether probability of 
class membership of the unknown sample may be rejected at a 
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confidence level greater than the preselected level, according to the 
model. 

19. The method as set forth in claim 18/ wherein the probability 
distribution function is formed using an algorithm selected from the group 

5 consisting of chi-squared distribution function evaluation and Hotelling's 
T-statistic evaluation. . 

20. The method as set forth in claim 18, wherein the 
preselected level is in the range of approximately 3 to 5 standard 
deviations as defined by the probability distribution function. 

10 21. The method as set forth in claim 12, wherein the unknown 

sample and each of the calibration samples includes a second analyte 
having concentration within a preselected range. 

22. The method as set forth in claim 21 , wherein the second 
analyte is triglycerides. * : 1 -- 

15 23. The method as set forth in claim 22, wherein the second 

analyte is total protein. 

24. The method as set forth in claims 1, 2, 1 1 or 12, wherein 
the step of constructing a calibration model includes removing redundant 
data from data corresponding to the reduced set of calibration samples. 

20 25. The method as set forth in claims 1 , 2, 1 1 or 1 2, wherein 

the step of constructing a calibration model is performed using an 
algorithm selected from the group consisting of principal component 
regression, partial least squares, multiple linear regression, and artificial 
neural networks. 

25 26. The method as set forth in claims 1 , 2, 1 1 or 1 2, wherein 

the step of constructing a calibration model is performed using an 
algorithm selected from the group consisting of principal component 
regression, partial least squares, and multiple linear regression, and 
includes selecting an optimal number of score vectors to use in the 

30 calibration model whereby redundant data may be removed from data 
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corresponding to the reduced set of calibration samples. 

. 27. The method as set forth in claim 26 wherein the step of 
selecting the optimal number of score vectors includes: 

constructing n preliminary calibration models, each preliminary 
calibration model using a different number of score vectors selected from 
a range of 1 through n; 

determining a standard error of prediction for each of the 
preliminary calibration models; and 

comparing the standard error of prediction for the preliminary 
models to determine the optimal number of score vectors. 

28. The method as set forth in claim 27 wherein comparing the 
standard error, of prediction is performed using an algorithm selected from 
the group consisting of F-test and local minimum determination. 

29. The method as set forth in claims 2 or 12, wherein the step 
of dividing the set bf eigenvalues includes determining the number of 
primary eigenvalues q by an iterative method which compares variance 
of the q th eigenvalue to the variance of the pooled error eigenvalues using 
an F-test. 

30. The method as set forth in claim 29, wherein the step of 
determining the number of primary eigenvalues q includes weighing the 
eigenvalues by an amount proportional to information explained by 
associated eigenvectors to produce a set of reduced eigenvalues. 

31 . Apparatus for determining concentration of an analyte in an 
unknown sample of a biological fluid of a mammal comprising: 

a positioner unit capable of sequentially positioning the unknown 
sample and each of a set of calibration samples of the biological fluid 
collected from a plurality of sources; 

a radiation emitter capable of emitting near-infrared 
electromagnetic radiation at a preselected plurality of wavelengths, said 
radiation emitter positioned to sequentially direct radiation of each of the 
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wavelengths into and partially through each of the calibration samples 
and the unknown sample; 

a near-infrared electromagnetic radiation; detector disposed to 
sequentially receive and measure intensity of the radiation transmitted 
5 through each of the calibration samples at each of the wavelengths to 

form a set of calibration data and through the unknown sample to form a 
set of sample data; and 

a computer connected to said detector and having a general 
purpose microprocessor configured with computer program code to form 

10 the set of calibration data into a matrix defining a space, form a subspace 
of the space wherein sources of relatively greater variations within the 
set of calibration data are represented, project the set of calibration data 
into the subspace, determine a generalized distance within the subspace 
between each calibration sample and a centroid defined by a distribution 

1 5 formed by the set of calibration samples, identify calibration outliers as 
those calibration samples having a generalized distance greater than a 
preselected magnitude, form a reduced set of calibration samples from 
calibration samples remaining after removal of calibration outliers, 
construct a calibration model from the reduced set of calibration samples 

20 to predict concentration of the analyte in the unknown sample, project 
the set of sample data into a space defined by the model, determine a 
generalized distance for the unknown sample according to the model, 
identify the unknown sample as a sample outlier provided the generalized 
distance of the unknown sample is greater than the preselected 

25 magnitude, and predict concentration of the analyte in the unknown 

sample according to the model provided the generalized distance of the 
unknown sample is not greater than the preselected magnitude. 

32. The apparatus of claim 31, wherein said positioner unit 
comprises: 

30 a flowcell having an input orifice and an output orifice; and 



41 

a pump disposed in fluid connection between said input orifice and 
said output orifice vyhereby each of the set of calibration samples and the 
unknown sample may be sequentially circulated through said flowcell. 

33. The apparatus of claim 31 , further comprising a temperature 
controller capable of controlling temperature of said positioner unit and 
said detector. 

34. The apparatus of claim 31, wherein each of the generalized 
distances is a Mahalanobis distance determined from the following 
relationship: 

wherein MDj is the Mahalanobis distance between an 7 th calibration 
sample and the centroid x of the set of calibration samples, S ' 1 is the 
inverted variance-covariance matrix of the set of calibration data, and 
{X| - xT is the transpose of fx t -lc). 

35. The apparatus of claim 31, wherein each generalized 
distance is a Robust distance determined using an algorithm selected 
from the group consisting of minimum volume ellipsoid estimator and 
projection algorithm. 

36. The apparatus of claim 31, further comprising a noise 
reducer coupled to said radiation emitter and said detector, and capable 
of reducing noise in measurements of intensity of that portion of the 
radiation transmitted through each of the calibration samples and the 

unknown sample. 

37. Apparatus for determining concentration of an analyte in an 
unknown sample of a biological fluid of a mammal comprising: 

a positioner unit capable of sequentially positioning the unknown 
sample and each of a set of calibration samples of the biological fluid 
collected from a plurality of sources, including a flowcell having an input 
orifice and an output orifice, and a pump disposed in fluid connection 
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between said input orifice and said output orifice whereby each of the set 
of calibration samples and the unknown sample may be sequentially . 
circulated through said flowcell; 

a radiation emitter capable of emitting near-infrared 
5 electromagnetic radiation at a preselected plurality of wavelengths, said 
radiation emitter positioned to sequentially direct the radiation of each of 
the wavelengths into and partially through each of the calibration 
samples and the unknown sample; 

a near-infrared electromagnetic radiation detector disposed to 

10 sequentially receive and measure intensity of the radiation transmitted 
through each of the calibration samples at each of the wavelengths to 
form a set of calibration data and through the unknown sample to form a 
set of sample data; 

a temperature controller capable of controlling temperature of said 

1 5 positioner unit and said detector; 

a noise reducer coupled to said radiation emitter and said detector, 
and capable of reducing noise in measurements of intensity of that 
portion of the radiation transmitted through each of the calibration 
samples and the unknown sample; and 

20 a computer connected to said detector and having a general 

purpose microprocessor configured with computer program code to form 
the set of calibration data into a matrix defining a space, form a subspace 
of the space wherein sources of relatively greater variations within the 
set of calibration data are represented, project the set of calibration data 

25 into the subspace, determine a generalized distance within the subspace 

between each calibration sample and a centroid defined by a distribution 
formed by the set of calibration samples, identify calibration outliers as 
those calibration samples having a generalized distance greater than a 
preselected magnitude, form a reduced set of calibration samples from 

30 calibration samples remaining after removal of calibration outliers, 
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construct a calibration model from the reduced set of calibration samples 
to predict concentration of the analyte in the unknown sample, project 
the set of sample data into a space defined by the model, determine a 
generalized distance for the unknown sample according to the model, 
identify the unknown sample as, a sample outlier provided the generalized 
distance of the unknown sample is greater than the preselected 
magnitude, and predict concentration of the analyte in the unknown 
sample according to the model provided the generalized distance of the 
unknown sample is not greater than the preselected magnitude. 

38. The apparatus of claim 37, wherein each of the generalized 
distances is a Mahalanobis distance determined from the following 
relationship: 

MD^ [ (x^x) S" 1 (Xj-x) c ] 1/2 

wherein MD, is the Mahalanobis distance between an / h calibration 
sample and the centroid x of the set of calibration samples, S' 1 is the 
inverted variance-covartance matrix of the set of calibration data, and 
fx,-*) 1 is the transpose of (xfx). 

.39. - The apparatus of claim 37, wherein each generalized 
distance is a Robust distance determined using an algorithm selected 
from the group consisting of minimum volume ellipsoid estimator and 
projection algorithm. 

40. The apparatus of claims 36, 38, or 39, wherein: 
said radiation emitter includes a relatively broad bandwidth near- 
infrared electromagnetic radiation source and a monochrometer disposed 
between said source and said positioner unit; and 

said noise reducer includes a chopper disposed between said 
source and said monochrometer whereby radiation from said source may 
be alternatively blocked from transmission to said monochrometer, and 
a synchronizer operably connected to said chopper and said detector 
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whereby signals produced in said detector when radiation from said 
source is blocked by said chopper may be subtracted from signals 
produced in said detector when radiation from said source is not blocked 
by said chopper. 

5 41 . The apparatus of claims 36, 38, or 39, wherein: 

said radiation emitter includes a relatively broad bandwidth near- 
infrared electromagnetic radiation source and a filter wheel disposed 
between said source and said positioner unit; and 

said noise reducer includes a chopper disposed between said 
10 source and said monochrometer whereby radiation from said source may 
be alternatively blocked from transmission to said monochrometer, and 
a synchronizer operably connected to said chopper and said detector 
whereby signals produced in said detector whbn radiation from said 
source is blocked by said chopper may be subtracted from signals 
1 5 produced in said detector when radiation from said source is not blocked 
by said chopper. 

42. The apparatus of claims 36, 38, or 39, wherein: 
said radiation emitter includes a plurality of relatively narrow 
bandwidth near-infrared electromagnetic radiation sources connected to 
20 said computer whereby said sources may be activated in a preselected 
sequential order; and 

said noise reducer includes a pulse driver operably connected to 
each of said sources and said detector whereby signals produced in said 
detector when radiation from set of sources is 'not pulsed by said driver 
25 may be subtracted from signals produced in said detector, when radiation 

from said sources is pulsed by said driver. 
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18. Collect data from samples where concentration of 
analyte of interest is unknown. 
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19. Apply pretreatment techniques used during the 
calibration phase. 
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20. Project data into principal component space defined 
by the calibration model. 
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21. Compute generalized distance and estimate 
probability of class membership. 
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